LCORE-1723: Cross-Encoder Reranking for enhanced RAG by Anxhela21 · Pull Request #1566 · lightspeed-core/lightspeed-stack

Anxhela21 · 2026-04-21T21:15:59Z

Description

Cross-Encoder Reranking for Enhanced RAG Quality

Summary

Implemented neural reranking using cross-encoder models to improve RAG chunk relevance scoring. This enhancement replaces simple vector similarity with semantic relevance scoring when combining results from multiple knowledge sources (BYOK + OKP) that have different search types e.g. hybrid, semantic, etc

Key Features

Cross-encoder model integration using sentence-transformers
Configurable model selection (default: cross-encoder/ms-marco-MiniLM-L6-v2)
Query-document pair scoring for improved semantic relevance
Efficient model caching to avoid reloading between requests

Flexible Configuration

reranker:
enabled: true
model: "cross-encoder/ms-marco-MiniLM-L6-v2"
top_k_multiplier: 2.0 # Fetch 2x chunks, rerank, keep top_k
byok_boost: 1.2 # Boost factor for BYOK sources
okp_boost: 1.0 # Boost factor for OKP sources

Auto-Enable

Automatically enables when both BYOK and OKP RAG sources are configured
Only when user hasn't explicitly configured reranker settings (see above)

Error Handling

Fallback to original vector scores if reranking fails
Lazy model loading with error recovery
Model caching

Technical Implementation

Modified Files

src/utils/vector_search.py - Core reranking logic and pipeline integration
src/models/config.py - Configuration schema and validation
src/configuration.py - Configuration property access
src/constants.py - Default boost factors and settings
pyproject.toml - Added sentence-transformers dependency

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Improved RAG relevance with optional cross-encoder chunk reranking.
- New reranker settings to customize model, top-K, and boost factors.
- BYOK chunks receive adjustable score boosting for higher relevance.
- Automatic reranker activation when both BYOK and OKP RAG sources are configured.
- Robust fallback: retrieval remains stable if the reranker cannot load or fails.

coderabbitai · 2026-04-21T21:16:06Z

Warning

Rate limit exceeded

@Anxhela21 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 32 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4a541b95-3282-4eca-ab95-12ea02bfe20c

📥 Commits

Reviewing files that changed from the base of the PR and between d664680 and 9eb794a.

📒 Files selected for processing (4)

pyproject.toml
src/constants.py
src/models/config.py
tests/unit/models/config/test_reranker_configuration.py

Walkthrough

Adds a cross-encoder reranking pipeline: new reranker configuration model and AppConfig property, sentence-transformers dependency (duplicated in pyproject), BYOK rerank boost constant, cached CrossEncoder loading, score normalization and BYOK boosting, refactors chunk retrieval and integrates reranking into build_rag_context.

Changes

Cohort / File(s)	Summary
Dependencies `pyproject.toml`	Inserts `sentence-transformers>=2.0.0` into `[project].dependencies` (appears twice / duplicated in the dependency array).
Configuration Models `src/models/config.py`, `src/configuration.py`	Adds `RerankerConfiguration` model (`enabled`, `model`, `top_k_multiplier`, `byok_boost`, `okp_boost`, internal `_explicitly_configured`) and a post-validator that auto-enables reranker when BYOK and OKP RAG are both present unless explicitly configured. Exposes `AppConfig.reranker` property.
Constants `src/constants.py`	Adds `BYOK_RAG_RERANK_BOOST = 1.2`.
RAG Pipeline / Vector Search `src/utils/vector_search.py`	Implements lazy cached `sentence_transformers.CrossEncoder` loading; reranks candidate (query, chunk.content) pairs, min-max normalizes scores, supports top_k limiting and fallback to original scores on errors, applies BYOK-specific score boosting, refactors Solr/BYOK retrieval to accept per-call limits, fetches larger candidate pools, merges, optionally reranks, and rebuilds referenced documents from final selected chunks.
Tests `tests/unit/models/config/test_reranker_configuration.py`, `tests/unit/utils/test_vector_search.py`	Adds unit tests for `RerankerConfiguration` defaults/validation and extensive tests for cross-encoder loading, caching, reranking behavior, normalization, fallback handling, and BYOK boost application.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Builder as build_rag_context
    participant BYOK as _fetch_byok_chunks
    participant Solr as _fetch_solr_chunks
    participant Merge as MergeCandidates
    participant Reranker as CrossEncoderReranker
    participant Booster as BYOKBooster
    participant Selector as SelectTopK
    participant Context as BuildRAGContext

    Client->>Builder: query, configuration
    Builder->>BYOK: fetch (max_chunks = 2×BYOK_MAX)
    BYOK-->>Builder: byok_chunks
    Builder->>Solr: fetch (max_chunks = 2×BYOK_MAX)
    Solr-->>Builder: solr_chunks
    Builder->>Merge: combine candidates
    Merge-->>Builder: merged_candidates
    alt configuration.reranker.enabled
        Builder->>Reranker: (query, merged_candidates)
        Reranker->>Reranker: predict scores for (query, chunk.content)
        Reranker->>Reranker: min-max normalize -> [0,1]
        Reranker-->>Builder: reranked_chunks
        Builder->>Booster: apply BYOK boost (multiply non-OKP scores)
        Booster-->>Builder: boosted_chunks
        Builder->>Selector: sort & select top BYOK_MAX
    else
        Builder->>Selector: sort by original scores & select top BYOK_MAX
    end
    Selector-->>Builder: final_chunks
    Builder->>Context: build referenced_documents from final_chunks
    Context-->>Client: RAGContext

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main feature: implementing cross-encoder reranking to enhance RAG, which is the primary objective of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 98.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 75-79: Remove the duplicate dependency entry for
"sentence-transformers>=2.0.0" in pyproject.toml (the repeated listing with the
same comment), leaving only a single declaration; also check the llslibdev group
requirement (">=5.0.0") and confirm whether the runtime floor should be raised
to ">=5.0.0" to match required features, updating the single remaining
"sentence-transformers" entry accordingly if the code needs 5.0.0+.

In `@src/constants.py`:
- Around line 195-196: Add a Final[float] type annotation to the new constant
BYOK_RAG_RERANK_BOOST so it matches the other constants' immutability checks;
update the declaration for BYOK_RAG_RERANK_BOOST to use Final[float] and ensure
Final is imported from typing at the top of the file if it isn't already.

In `@src/models/config.py`:
- Around line 1828-1866: The current PrivateAttr(_explicitly_configured: bool =
PrivateAttr(default=False)) makes _explicitly_configured always present and the
value-comparison in mark_as_explicitly_configured() misses cases where a user
explicitly provided a field equal to its default; change _explicitly_configured
to a tri-state (e.g. PrivateAttr(default=None)) and rewrite
mark_as_explicitly_configured to use the Pydantic-provided
self.model_fields_set: compute the set of reranker field names you care about
(enabled, model, model_id, provider_id, top_k_multiplier, byok_boost,
okp_boost), check if any of those names are in self.model_fields_set and if so
set self._explicitly_configured = True (leave it None/False otherwise), and
update the consuming code (the hasattr(...) check elsewhere) to test for
_explicitly_configured is not None / True instead of hasattr so the auto-enable
branch can run correctly.
- Around line 1815-1826: The three RerankerConfiguration fields
(top_k_multiplier, byok_boost, okp_boost) are declared but not used; update the
reranking code to read these values from the configuration instead of using
hard-coded constants: replace the hard-coded 2x in the pool_size calculation in
vector_search.py with configuration.reranker.top_k_multiplier when computing
pool_size (or its equivalent variable), change the boost applied in
_apply_byok_rerank_boost to use configuration.reranker.byok_boost instead of
constants.BYOK_RAG_RERANK_BOOST, and ensure OKP chunks receive
configuration.reranker.okp_boost (apply the multiplier where OKP chunks
currently pass through unmodified). Keep existing defaults from the
RerankerConfiguration when reading these fields so behavior is unchanged if the
user omits them.

In `@src/utils/vector_search.py`:
- Around line 28-56: The module-level cache _cross_encoder_models and loader
_get_cross_encoder are not concurrency-safe and can instantiate
CrossEncoder(model_name) multiple times on cold start; fix by serializing loads
for a given model_name (single-flight) — e.g., add a module-level dict of
asyncio.Lock objects (or a per-name asyncio.Event/Future) and acquire the lock
for model_name inside _get_cross_encoder (or convert the loader to a synchronous
function, apply functools.cache, and call it via asyncio.to_thread) so only one
task constructs CrossEncoder(model_name) and others await the result, then store
the loaded model (or None on failure) into _cross_encoder_models before
releasing the lock; ensure any exception handling still sets the cache to None
to avoid repeated retries.
- Around line 722-728: The function build_rag_context currently takes an unused
parameter moderation_decision (annotated with # pylint:
disable=unused-argument); either remove that parameter from the function
signature and delete the pylint disable, and update its three callers (the calls
in the functions at src/app/endpoints/query.py,
src/app/endpoints/streaming_query.py, and src/app/endpoints/responses.py) to
stop passing moderation_decision, or if you intend to use it later, keep the
parameter but replace the pylint disable with a short TODO comment in
build_rag_context explaining the planned future use and why moderation_decision
is intentionally ignored; reference the function name build_rag_context and the
three caller sites when making the change so the signature and call sites remain
consistent.
- Around line 59-134: The async function _rerank_chunks_with_cross_encoder
currently calls the blocking sentence_transformers.CrossEncoder.predict
synchronously (model.predict), which will block the asyncio event loop; change
the call to run in a thread via asyncio.to_thread (e.g., scores = await
asyncio.to_thread(model.predict, pairs)) so inference executes off the event
loop, preserving async responsiveness, and keep the subsequent score processing
logic unchanged; ensure to import asyncio if not already present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 89ed94e6-8009-41d5-8e4c-3f2892881c37

📥 Commits

Reviewing files that changed from the base of the PR and between 2d0c79a and 82caaf6.

📒 Files selected for processing (5)

pyproject.toml
src/configuration.py
src/constants.py
src/models/config.py
src/utils/vector_search.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: bandit
GitHub Check: build-pr
GitHub Check: unit_tests (3.12)
GitHub Check: integration_tests (3.12)
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Import FastAPI dependencies with: from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Use typing_extensions.Self for model validators in Pydantic models
Use modern union type syntax str | int instead of Union[str, int]
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Use @model_validator and @field_validator for Pydantic model validation
Complete type annotations for all class attributes; use specific types, not Any
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections

Files:

src/constants.py
src/configuration.py
src/models/config.py
src/utils/vector_search.py

**/constants.py

📄 CodeRabbit inference engine (AGENTS.md)

Central constants.py for shared constants with descriptive comments

Files:

src/constants.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models extend ConfigurationBase for config, BaseModel for data models

Files:

src/constants.py
src/configuration.py
src/models/config.py
src/utils/vector_search.py

src/**/config*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/config*.py: All config uses Pydantic models extending ConfigurationBase
Base class sets extra="forbid" to reject unknown fields in Pydantic models
Use @field_validator and @model_validator for custom validation in Pydantic models
Use type hints like Optional[FilePath], PositiveInt, SecretStr in Pydantic models

Files:

src/configuration.py
src/models/config.py

🧠 Learnings (6)

📚 Learning: 2026-03-17T11:34:53.242Z

Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1335
File: docs/config.md:411-411
Timestamp: 2026-03-17T11:34:53.242Z
Learning: In the lightspeed-stack project (`src/models/config.py`, `docs/config.md`), the internal Solr filter `is_chunk:true` (defined as `SOLR_CHUNK_FILTER_QUERY` in `src/constants.py`) is always injected by the system for OKP searches and is intentionally hidden from users. The `chunk_filter_query` field in `OkpConfiguration` is user-facing and additive-only, but the documentation must NOT mention the internal `is_chunk:true` behavior — this is a deliberate design decision by the maintainers.

Applied to files:

src/constants.py
src/utils/vector_search.py

📚 Learning: 2026-04-19T15:40:25.624Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/config*.py : All config uses Pydantic models extending `ConfigurationBase`

Applied to files:

src/models/config.py

📚 Learning: 2026-04-19T15:40:25.624Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/*.py : Pydantic models extend `ConfigurationBase` for config, `BaseModel` for data models

Applied to files:

src/models/config.py

📚 Learning: 2026-01-12T10:58:40.230Z

Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

src/models/config.py

📚 Learning: 2026-02-25T07:46:33.545Z

Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

src/models/config.py

📚 Learning: 2026-01-30T13:33:40.479Z

Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1073
File: src/app/endpoints/query_v2.py:575-586
Timestamp: 2026-01-30T13:33:40.479Z
Learning: In `src/app/endpoints/query_v2.py`, the `parse_referenced_documents_from_responses_api` function filters referenced documents to include only those with `doc_title` or `doc_url` because these documents are displayed in the frontend. Documents with only `doc_id` are intentionally excluded as they wouldn't provide useful information to end users.

Applied to files:

src/utils/vector_search.py

🔇 Additional comments (3)

src/utils/vector_search.py (1)

722-746: pool_size/top_k ignore the configured top_k_multiplier.

pool_size = 2 * constants.BYOK_RAG_MAX_CHUNKS hard-codes the 2× factor, but RerankerConfiguration.top_k_multiplier (default 2.0) is intended to govern this exact value. Likewise, _apply_byok_rerank_boost(...) reads the multiplier from constants.BYOK_RAG_RERANK_BOOST, ignoring reranker.byok_boost and reranker.okp_boost. See the comment on src/models/config.py RerankerConfiguration for the full picture and a suggested wiring diff.

src/configuration.py (1)

469-475: LGTM.

The new reranker property follows the same pattern as the surrounding rag/okp properties. Since reranker is a non-optional field with a default_factory in Configuration, the access is safe once _configuration is loaded.
src/models/config.py (1)
1822-1823: Remove unused placeholder fields or wire them up.

model_id and provider_id in RerankerConfiguration are not consumed by src/utils/vector_search.py (only model and enabled are read). If these fields are reserved for future Llama Stack-backed reranker provider support, they should either be wired up or removed to avoid leaking unimplemented schema to users. The Pydantic v2 protected namespace concern does not apply here—the project uses Pydantic v2.10.6+, where the default protected_namespaces setting only flags direct conflicts with methods like model_validate and model_dump, not fields with the model_ prefix.
			> Likely an incorrect or invalid review comment.

coderabbitai · 2026-04-27T17:44:52Z

+# Lazy-loaded cross-encoder models for reranking RAG chunks (CPU-bound, use in thread).
+# Cache models by name to avoid reloading the same model multiple times.
+# Not a constant; pylint invalid-name is disabled for this module-level singleton.
+_cross_encoder_models: dict[str, Any] = {}  # pylint: disable=invalid-name
+
+
+def _get_cross_encoder(model_name: str) -> Any:
+    """Return the lazy-loaded cross-encoder model for reranking.
+
+    Args:
+        model_name: Name of the cross-encoder model to load.
+
+    Returns:
+        Loaded CrossEncoder model instance, or None if loading fails.
+    """
+    if model_name not in _cross_encoder_models:
+        try:
+            from sentence_transformers import (  # pylint: disable=import-outside-toplevel
+                CrossEncoder,
+            )
+
+            _cross_encoder_models[model_name] = CrossEncoder(model_name)
+            logger.info("Loaded cross-encoder for RAG reranking: %s", model_name)
+        except Exception as e:  # pylint: disable=broad-exception-caught
+            logger.warning(
+                "Could not load cross-encoder for reranking (%s): %s", model_name, e
+            )
+            _cross_encoder_models[model_name] = None
+    return _cross_encoder_models[model_name]


⚠️ Potential issue | 🟠 Major

Cross-encoder cache is not safe against concurrent cold-start loads.

_cross_encoder_models is a plain module-level dict. With multiple concurrent requests arriving while the cache is cold (e.g. right after process start), each await-suspended task can re-enter _get_cross_encoder before any prior task has populated the cache, causing CrossEncoder(model_name) to be instantiated several times — wasteful CPU/RAM on startup and possible OOM if the model is large.

A single-flight pattern with asyncio.Lock (or functools.cache if you make the loader sync and only call it via asyncio.to_thread) avoids this:

♻️ Single-flight loader sketch

-_cross_encoder_models: dict[str, Any] = {} # pylint: disable=invalid-name +_cross_encoder_models: dict[str, Any] = {} # pylint: disable=invalid-name +_cross_encoder_load_lock = asyncio.Lock() + + +async def _get_cross_encoder(model_name: str) -> Any: + if model_name in _cross_encoder_models: + return _cross_encoder_models[model_name] + async with _cross_encoder_load_lock: + if model_name in _cross_encoder_models: + return _cross_encoder_models[model_name] + try: + from sentence_transformers import CrossEncoder # pylint: disable=import-outside-toplevel + model = await asyncio.to_thread(CrossEncoder, model_name) + _cross_encoder_models[model_name] = model + logger.info("Loaded cross-encoder for RAG reranking: %s", model_name) + except Exception as e: # pylint: disable=broad-exception-caught + logger.warning("Could not load cross-encoder (%s): %s", model_name, e) + _cross_encoder_models[model_name] = None + return _cross_encoder_models[model_name]

Note: when running uvicorn with workers > 1, each process still gets its own copy of the model — that is unavoidable here, but worth keeping in mind for memory budgeting.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 28 - 56, The module-level cache _cross_encoder_models and loader _get_cross_encoder are not concurrency-safe and can instantiate CrossEncoder(model_name) multiple times on cold start; fix by serializing loads for a given model_name (single-flight) — e.g., add a module-level dict of asyncio.Lock objects (or a per-name asyncio.Event/Future) and acquire the lock for model_name inside _get_cross_encoder (or convert the loader to a synchronous function, apply functools.cache, and call it via asyncio.to_thread) so only one task constructs CrossEncoder(model_name) and others await the result, then store the loaded model (or None on failure) into _cross_encoder_models before releasing the lock; ensure any exception handling still sets the cache to None to avoid repeated retries.

@syedriko or @tisnik do you think this is a valid concern?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2026-04-27T17:44:52Z

+async def _rerank_chunks_with_cross_encoder(
+    query: str,
+    chunks: list[RAGChunk],
+    top_k: int,
+    model_name: str = "cross-encoder/ms-marco-MiniLM-L6-v2",
+) -> list[RAGChunk]:
+    """Rerank chunks using configurable cross-encoder model.
+
+    Args:
+        query: The search query
+        chunks: RAG chunks to rerank
+        top_k: Number of top chunks to return
+        model_name: Cross-encoder model name to use
+
+    Returns:
+        Top top_k chunks sorted by cross-encoder score (descending)
+    """
+    if not chunks:
+        return []
+
+    try:
+        # Get the cached cross-encoder model
+        model = _get_cross_encoder(model_name)
+        if model is None:
+            raise RuntimeError(f"Failed to load cross-encoder model: {model_name}")
+
+        logger.debug("Using cross-encoder model: %s", model_name)
+
+        # Create query-chunk pairs for scoring
+        pairs = [(query, chunk.content) for chunk in chunks]
+        scores = model.predict(pairs)
+
+        if hasattr(scores, "tolist"):
+            scores = scores.tolist()
+
+        # Normalize scores to [0,1] range using min-max normalization
+        if len(scores) > 1:
+            min_score = min(scores)
+            max_score = max(scores)
+            score_range = max_score - min_score
+            if score_range > 0:
+                normalized_scores = [(score - min_score) / score_range for score in scores]
+            else:
+                # All scores are identical, assign 0.5 to all
+                normalized_scores = [0.5] * len(scores)
+        else:
+            # Single score, assign 1.0
+            normalized_scores = [1.0] * len(scores)
+
+        # Combine normalized scores with chunks and sort by score (descending)
+        indexed = list(zip(normalized_scores, chunks, strict=True))
+        indexed.sort(key=lambda x: x[0], reverse=True)
+        top_indexed = indexed[:top_k]
+
+        # Return RAGChunk list with normalized cross-encoder scores [0,1]
+        return [
+            RAGChunk(
+                content=chunk.content,
+                source=chunk.source,
+                score=float(score),
+                attributes=chunk.attributes,
+            )
+            for score, chunk in top_indexed
+        ]
+
+    except Exception as e:  # pylint: disable=broad-exception-caught
+        logger.warning(
+            "Cross-encoder reranking failed, falling back to original scoring: %s", e
+        )
+        # Fallback: sort by original score and take top_k
+        sorted_chunks = sorted(
+            chunks,
+            key=lambda c: c.score if c.score is not None else float("-inf"),
+            reverse=True,
+        )
+        return sorted_chunks[:top_k]


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sentence-transformers CrossEncoder predict synchronous blocking

💡 Result:

The CrossEncoder.predict method in sentence-transformers performs synchronous, blocking inference by default. It computes similarity scores for input sentence pairs in a blocking manner on the specified device(s), returning results only after completion. Key details: - Input: List of sentence pairs, e.g., [['sentence1a', 'sentence1b'], ['sentence2a', 'sentence2b']]. - Default behavior: Batched processing (batch_size=32), progress bar optional, returns numpy array or torch.Tensor of scores. - Blocking: No asynchronous option mentioned; the call waits for computation to finish. - Multi-device: Supports multiprocessing via device list (e.g., ["cuda:0", "cuda:1"]) or manual pool for parallel GPUs/CPUs, but still synchronous from the caller's perspective. - Example: from sentence_transformers import CrossEncoder model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') scores = model.predict([('Query1', 'Doc1'), ('Query1', 'Doc2')]) # Returns array of scores immediately after computation. This is confirmed in official documentation as the standard synchronous prediction workflow.

Citations:

1: https://sbert.net/docs/cross_encoder/usage/usage.html

2: https://www.sbert.net/docs/package%5Freference/cross%5Fencoder/cross%5Fencoder.html

3: https://www.sbert.net/examples/applications/cross-encoder/README.html

🏁 Script executed:

# First, locate and inspect the file to confirm line numbers and context fd -t f "vector_search.py" | head -5

Repository: lightspeed-core/lightspeed-stack

Length of output: 142

🏁 Script executed:

# Check Python version requirement in project fd -t f -e "pyproject.toml" -o -e "setup.py" -o -e ".python-version" | head -10

Repository: lightspeed-core/lightspeed-stack

Length of output: 193

🏁 Script executed:

# Inspect the vector_search.py file around the identified lines cat -n src/utils/vector_search.py | sed -n '59,100p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1768

🏁 Script executed:

# Also check the _get_cross_encoder implementation to assess cold-start blocking cat -n src/utils/vector_search.py | sed -n '34,56p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1164

🏁 Script executed:

# Check Python version requirement cat pyproject.toml | grep -A 5 "python"

Repository: lightspeed-core/lightspeed-stack

Length of output: 309

🏁 Script executed:

# Also check setup.py if it exists [ -f setup.py ] && cat setup.py | head -30 || echo "No setup.py found"

Repository: lightspeed-core/lightspeed-stack

Length of output: 94

🏁 Script executed:

# Verify the line 28 comment mentioned in the review about CPU-bound cat -n src/utils/vector_search.py | sed -n '1,35p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1581

🏁 Script executed:

# Check if asyncio is already imported at the top of the file head -30 src/utils/vector_search.py | grep -E "^import|^from"

Repository: lightspeed-core/lightspeed-stack

Length of output: 605

model.predict() is synchronous and blocks the event loop inside an async function.

sentence_transformers.CrossEncoder.predict is a CPU-bound PyTorch call that runs synchronously. Calling it directly at line 89 inside async def _rerank_chunks_with_cross_encoder blocks the asyncio event loop for the entire inference duration — preventing the worker from handling other concurrent requests. The module-level comment at line 28 explicitly notes "CPU-bound, use in thread", yet the code does not implement this.

Use asyncio.to_thread (available in Python 3.9+, and your project requires ≥3.12) to offload the blocking inference so the event loop remains responsive.

Proposed fix

# Create query-chunk pairs for scoring pairs = [(query, chunk.content) for chunk in chunks] - scores = model.predict(pairs) + scores = await asyncio.to_thread(model.predict, pairs)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 59 - 134, The async function _rerank_chunks_with_cross_encoder currently calls the blocking sentence_transformers.CrossEncoder.predict synchronously (model.predict), which will block the asyncio event loop; change the call to run in a thread via asyncio.to_thread (e.g., scores = await asyncio.to_thread(model.predict, pairs)) so inference executes off the event loop, preserving async responsiveness, and keep the subsequent score processing logic unchanged; ensure to import asyncio if not already present.

This seems a valid concern

@are-ces, agreed — this is a valid issue. The fix is straightforward since asyncio is already imported. Here's the minimal change:

# Create query-chunk pairs for scoring pairs = [(query, chunk.content) for chunk in chunks] - scores = model.predict(pairs) + scores = await asyncio.to_thread(model.predict, pairs)

As a secondary note, there's also a cold-start blocking concern: on the first call for a given model_name, _get_cross_encoder calls CrossEncoder(model_name) synchronously (model download + load), which also blocks the event loop. You may want to wrap that call in asyncio.to_thread as well:

- model = _get_cross_encoder(model_name) + model = await asyncio.to_thread(_get_cross_encoder, model_name)

The module-level comment even acknowledges this with # CPU-bound, use in thread, so both fixes align with the documented intent.

coderabbitai · 2026-04-27T17:44:52Z

+async def build_rag_context(  # pylint: disable=too-many-locals
    client: AsyncLlamaStackClient,
-    moderation_decision: str,
+    moderation_decision: str,  # pylint: disable=unused-argument
    query: str,
    vector_store_ids: Optional[list[str]],
    solr: Optional[SolrVectorSearchRequest] = None,
 ) -> RAGContext:


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all call sites of build_rag_context to confirm impact of removing the parameter. rg -nP --type=py -C2 '\bbuild_rag_context\s*\('

Repository: lightspeed-core/lightspeed-stack

Length of output: 2197

🏁 Script executed:

# Read the build_rag_context function implementation sed -n '722,800p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 3006

🏁 Script executed:

# Check the exact line numbers and context for call sites echo "=== query.py ===" && sed -n '170,185p' src/app/endpoints/query.py echo -e "\n=== streaming_query.py ===" && sed -n '228,245p' src/app/endpoints/streaming_query.py echo -e "\n=== responses.py ===" && sed -n '343,360p' src/app/endpoints/responses.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 1836

🏁 Script executed:

# Verify that moderation_decision is not used anywhere in the function body rg -A30 'async def build_rag_context' src/utils/vector_search.py | grep -i 'moderation_decision'

Repository: lightspeed-core/lightspeed-stack

Length of output: 141

Remove unused moderation_decision parameter or document its intended future use.

The parameter is annotated with # pylint: disable=unused-argument and is not referenced anywhere in the function body (nor in the docstring). The function builds RAG context without consulting moderation status, since moderation checks are already performed by callers before deciding whether to proceed. Remove the parameter and update the three call sites:

src/app/endpoints/query.py:179

src/app/endpoints/streaming_query.py:234

src/app/endpoints/responses.py:349

Alternatively, if there is planned future use, add a TODO comment explaining the intent instead of silently passing the query to vector stores for queries that may have been blocked.

Suggested signature cleanup

async def build_rag_context( # pylint: disable=too-many-locals client: AsyncLlamaStackClient, - moderation_decision: str, # pylint: disable=unused-argument query: str, vector_store_ids: Optional[list[str]], solr: Optional[SolrVectorSearchRequest] = None, ) -> RAGContext:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 722 - 728, The function build_rag_context currently takes an unused parameter moderation_decision (annotated with # pylint: disable=unused-argument); either remove that parameter from the function signature and delete the pylint disable, and update its three callers (the calls in the functions at src/app/endpoints/query.py, src/app/endpoints/streaming_query.py, and src/app/endpoints/responses.py) to stop passing moderation_decision, or if you intend to use it later, keep the parameter but replace the pylint disable with a short TODO comment in build_rag_context explaining the planned future use and why moderation_decision is intentionally ignored; reference the function name build_rag_context and the three caller sites when making the change so the signature and call sites remain consistent.

Signed-off-by: Anxhela Coba <acoba@redhat.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/utils/vector_search.py (1)

642-685: ⚠️ Potential issue | 🟠 Major

The OKP fetch path never actually expands the rerank candidate pool.

_fetch_solr_rag() accepts max_chunks, but the value never reaches vector_io.query() because _build_query_params(solr) still uses the default k. build_rag_context() also calls _fetch_solr_rag() without max_chunks=pool_size. As a result, OKP stays capped at the default retrieval size, so the reranker never sees the larger Solr pool described in this PR.

♻️ Suggested fix

-            params = _build_query_params(solr)
+            params = _build_query_params(solr, k=limit)

-    solr_chunks_task = _fetch_solr_rag(client, query, solr)
+    solr_chunks_task = _fetch_solr_rag(
+        client, query, solr, max_chunks=pool_size
+    )

Also applies to: 748-755

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 642 - 685, The Solr fetch never
expands the rerank candidate pool because _fetch_solr_rag ignores its max_chunks
arg and _build_query_params still uses the default k; also build_rag_context
doesn't pass pool_size into _fetch_solr_rag. Fix by threading max_chunks
through: update _fetch_solr_rag to pass max_chunks (or computed limit) into the
params returned by _build_query_params (ensure _build_query_params accepts a
k/limit argument and sets params['k'] accordingly), call client.vector_io.query
with those params, and modify build_rag_context to call _fetch_solr_rag(...,
max_chunks=pool_size) so the Solr query and vector_io.query receive the larger
pool_size for reranking (refer to functions _fetch_solr_rag,
_build_query_params, build_rag_context, vector_io.query, and variables
max_chunks/pool_size).

♻️ Duplicate comments (3)

src/utils/vector_search.py (2)

79-90: ⚠️ Potential issue | 🔴 Critical

Offload CrossEncoder.predict() off the event loop.

_rerank_chunks_with_cross_encoder() is async, but model.predict() is a synchronous inference call. Awaiting this path under load will stall the event loop for the full rerank duration and block unrelated requests.

♻️ Minimal fix

-        model = _get_cross_encoder(model_name)
+        model = await _get_cross_encoder(model_name)
@@
-        scores = model.predict(pairs)
+        scores = await asyncio.to_thread(model.predict, pairs)

According to the official sentence-transformers documentation, is `CrossEncoder.predict(...)` a synchronous/blocking call, and is `asyncio.to_thread(...)` the appropriate way to invoke it from Python async code?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 79 - 90,
_rerank_chunks_with_cross_encoder is async but calls the synchronous
CrossEncoder.predict (via model.predict) which blocks the event loop; change the
call to run the blocking inference off the loop using asyncio.to_thread (or
loop.run_in_executor) so prediction runs in a worker thread and returns
awaitably. Locate the call to model.predict in _rerank_chunks_with_cross_encoder
and replace the direct call with an awaited asyncio.to_thread(lambda:
model.predict(pairs)) (or equivalent) so the CrossEncoder.predict invocation is
executed off the event loop.

34-56: ⚠️ Potential issue | 🟠 Major

Serialize cold-start loads for each cross-encoder model.

This cache is still racy: concurrent requests can all miss _cross_encoder_models and instantiate the same model before the first load stores it, which is an expensive CPU/RAM spike for large rerankers. Wrap model construction in a per-model single-flight lock so only one task loads each model name.

♻️ Suggested single-flight pattern

 _cross_encoder_models: dict[str, Any] = {}  # pylint: disable=invalid-name
+_cross_encoder_load_locks: dict[str, asyncio.Lock] = {}  # pylint: disable=invalid-name
 
 
-def _get_cross_encoder(model_name: str) -> Any:
+async def _get_cross_encoder(model_name: str) -> Any:
     """Return the lazy-loaded cross-encoder model for reranking.
@@
-    if model_name not in _cross_encoder_models:
-        try:
-            from sentence_transformers import (  # pylint: disable=import-outside-toplevel
-                CrossEncoder,
-            )
-
-            _cross_encoder_models[model_name] = CrossEncoder(model_name)
-            logger.info("Loaded cross-encoder for RAG reranking: %s", model_name)
-        except Exception as e:  # pylint: disable=broad-exception-caught
-            logger.warning(
-                "Could not load cross-encoder for reranking (%s): %s", model_name, e
-            )
-            _cross_encoder_models[model_name] = None
+    if model_name in _cross_encoder_models:
+        return _cross_encoder_models[model_name]
+
+    lock = _cross_encoder_load_locks.setdefault(model_name, asyncio.Lock())
+    async with lock:
+        if model_name in _cross_encoder_models:
+            return _cross_encoder_models[model_name]
+        try:
+            from sentence_transformers import (  # pylint: disable=import-outside-toplevel
+                CrossEncoder,
+            )
+
+            _cross_encoder_models[model_name] = await asyncio.to_thread(
+                CrossEncoder, model_name
+            )
+            logger.info("Loaded cross-encoder for RAG reranking: %s", model_name)
+        except Exception as e:  # pylint: disable=broad-exception-caught
+            logger.warning(
+                "Could not load cross-encoder for reranking (%s): %s", model_name, e
+            )
+            _cross_encoder_models[model_name] = None
     return _cross_encoder_models[model_name]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 34 - 56, The _get_cross_encoder
function is racy because multiple callers can simultaneously miss
_cross_encoder_models and each instantiate the same heavy model; fix it by
introducing a per-model single-flight lock (e.g., a module-level dict
_cross_encoder_locks: Dict[str, threading.Lock]) and wrap the check-and-create
logic for a given model_name inside that lock: obtain or create the lock for
model_name (use setdefault), acquire it (with context manager) before
re-checking _cross_encoder_models[model_name], then instantiate CrossEncoder and
assign to _cross_encoder_models[model_name] inside the locked region (ensure
exceptions still set the cache to None and release the lock in finally). Update
references to _get_cross_encoder, _cross_encoder_models, and logger accordingly.

src/models/config.py (1)

1829-1861: ⚠️ Potential issue | 🟠 Major

_explicitly_configured never makes the auto-enable branch work as intended.

PrivateAttr(default=False) means _explicitly_configured is always present, so not hasattr(self.reranker, "_explicitly_configured") is always false and the auto-enable path never runs. The value-comparison validator also misses explicit input that matches defaults, e.g. reranker: { enabled: true }. Use model_fields_set to mark explicit input and test the flag value directly.

♻️ Suggested fix

 class RerankerConfiguration(ConfigurationBase):
@@
     _explicitly_configured: bool = PrivateAttr(default=False)
@@
     `@model_validator`(mode="after")
     def mark_as_explicitly_configured(self) -> Self:
         """Mark this configuration as explicitly set when instantiated from user input."""
-        # Only mark as explicitly configured if we're not using all default values
-        # This allows auto-enabling when user hasn't touched reranker settings
-        # Check if any field differs from default values
-        default_model = "cross-encoder/ms-marco-MiniLM-L6-v2"
-        default_top_k_multiplier = 2.0
-        default_byok_boost = 1.2
-        default_okp_boost = 1.0
-
-        # Check if any setting differs from defaults (indicates explicit configuration)
-        current_values = [
-            self.enabled,
-            self.model,
-            self.top_k_multiplier,
-            self.byok_boost,
-            self.okp_boost,
-        ]
-        default_values = [
-            True,
-            default_model,
-            default_top_k_multiplier,
-            default_byok_boost,
-            default_okp_boost,
-        ]
-
-        if current_values != default_values:
+        if self.model_fields_set:
             self._explicitly_configured = True
         return self
@@
-        if (
-            has_byok
-            and has_okp
-            and not hasattr(self.reranker, "_explicitly_configured")
-        ):
+        if has_byok and has_okp and not self.reranker._explicitly_configured:
             # pylint: disable=no-member
             if not self.reranker.enabled:
                 logger.info(
                     "Automatically enabling reranker: Both BYOK RAG (%d entries) and OKP "
                     "are configured. Reranking improves result quality when multiple "

In Pydantic v2, do `PrivateAttr(default=...)` attributes always exist on model instances, and does `model_fields_set` track fields explicitly provided in the input even when the provided value equals the field default?

Also applies to: 2154-2168

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/models/config.py` around lines 1829 - 1861, The _explicitly_configured
flag is never meaningful because PrivateAttr always exists; update the
mark_as_explicitly_configured validator (and the similar validator at lines
~2154-2168) to use model_fields_set to detect which fields were explicitly
provided and set self._explicitly_configured = True when any reranker-related
fields (e.g., "enabled", "model", "top_k_multiplier", "byok_boost", "okp_boost")
appear in self.model_fields_set or when the parent model passed reranker in its
input; also change any checks that use hasattr(self.reranker,
"_explicitly_configured") to inspect the flag's value directly
(self.reranker._explicitly_configured) so the auto-enable branch can run when
the user explicitly provided values even if they match defaults.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/vector_search.py`:
- Around line 182-205: The loop that builds ReferencedDocument from rag_chunks
currently appends entries even when both title and URL are missing; update the
logic in the loop (the section that computes attrs, doc_url, parsed_url and
before result.append) to skip/continue when both attrs.get("title") is falsy and
parsed_url is falsy/None so you only append ReferencedDocument when there is a
doc_title or a doc_url; keep existing dedup_key/seen handling and AnyUrl parsing
but add a guard (using the computed title and parsed_url) to avoid creating
empty ReferencedDocument objects.

---

Outside diff comments:
In `@src/utils/vector_search.py`:
- Around line 642-685: The Solr fetch never expands the rerank candidate pool
because _fetch_solr_rag ignores its max_chunks arg and _build_query_params still
uses the default k; also build_rag_context doesn't pass pool_size into
_fetch_solr_rag. Fix by threading max_chunks through: update _fetch_solr_rag to
pass max_chunks (or computed limit) into the params returned by
_build_query_params (ensure _build_query_params accepts a k/limit argument and
sets params['k'] accordingly), call client.vector_io.query with those params,
and modify build_rag_context to call _fetch_solr_rag(..., max_chunks=pool_size)
so the Solr query and vector_io.query receive the larger pool_size for reranking
(refer to functions _fetch_solr_rag, _build_query_params, build_rag_context,
vector_io.query, and variables max_chunks/pool_size).

---

Duplicate comments:
In `@src/models/config.py`:
- Around line 1829-1861: The _explicitly_configured flag is never meaningful
because PrivateAttr always exists; update the mark_as_explicitly_configured
validator (and the similar validator at lines ~2154-2168) to use
model_fields_set to detect which fields were explicitly provided and set
self._explicitly_configured = True when any reranker-related fields (e.g.,
"enabled", "model", "top_k_multiplier", "byok_boost", "okp_boost") appear in
self.model_fields_set or when the parent model passed reranker in its input;
also change any checks that use hasattr(self.reranker, "_explicitly_configured")
to inspect the flag's value directly (self.reranker._explicitly_configured) so
the auto-enable branch can run when the user explicitly provided values even if
they match defaults.

In `@src/utils/vector_search.py`:
- Around line 79-90: _rerank_chunks_with_cross_encoder is async but calls the
synchronous CrossEncoder.predict (via model.predict) which blocks the event
loop; change the call to run the blocking inference off the loop using
asyncio.to_thread (or loop.run_in_executor) so prediction runs in a worker
thread and returns awaitably. Locate the call to model.predict in
_rerank_chunks_with_cross_encoder and replace the direct call with an awaited
asyncio.to_thread(lambda: model.predict(pairs)) (or equivalent) so the
CrossEncoder.predict invocation is executed off the event loop.
- Around line 34-56: The _get_cross_encoder function is racy because multiple
callers can simultaneously miss _cross_encoder_models and each instantiate the
same heavy model; fix it by introducing a per-model single-flight lock (e.g., a
module-level dict _cross_encoder_locks: Dict[str, threading.Lock]) and wrap the
check-and-create logic for a given model_name inside that lock: obtain or create
the lock for model_name (use setdefault), acquire it (with context manager)
before re-checking _cross_encoder_models[model_name], then instantiate
CrossEncoder and assign to _cross_encoder_models[model_name] inside the locked
region (ensure exceptions still set the cache to None and release the lock in
finally). Update references to _get_cross_encoder, _cross_encoder_models, and
logger accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 32b1995a-8892-4d14-a537-53427a6d3620

📥 Commits

Reviewing files that changed from the base of the PR and between 82caaf6 and d664680.

📒 Files selected for processing (7)

pyproject.toml
src/configuration.py
src/constants.py
src/models/config.py
src/utils/vector_search.py
tests/unit/models/config/test_reranker_configuration.py
tests/unit/utils/test_vector_search.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: build-pr
GitHub Check: unit_tests (3.13)
GitHub Check: unit_tests (3.12)
GitHub Check: integration_tests (3.12)
GitHub Check: integration_tests (3.13)
GitHub Check: Pylinter
GitHub Check: mypy
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Import FastAPI dependencies with: from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Use typing_extensions.Self for model validators in Pydantic models
Use modern union type syntax str | int instead of Union[str, int]
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Use @model_validator and @field_validator for Pydantic model validation
Complete type annotations for all class attributes; use specific types, not Any
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections

Files:

src/constants.py
src/configuration.py
tests/unit/models/config/test_reranker_configuration.py
src/models/config.py
tests/unit/utils/test_vector_search.py
src/utils/vector_search.py

**/constants.py

📄 CodeRabbit inference engine (AGENTS.md)

Central constants.py for shared constants with descriptive comments

Files:

src/constants.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models extend ConfigurationBase for config, BaseModel for data models

Files:

src/constants.py
src/configuration.py
src/models/config.py
src/utils/vector_search.py

src/**/config*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/config*.py: All config uses Pydantic models extending ConfigurationBase
Base class sets extra="forbid" to reject unknown fields in Pydantic models
Use @field_validator and @model_validator for custom validation in Pydantic models
Use type hints like Optional[FilePath], PositiveInt, SecretStr in Pydantic models

Files:

src/configuration.py
src/models/config.py

tests/{unit,integration}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests
Do not use unittest; pytest is the standard for this project
Use pytest-mock for AsyncMock objects in tests
Use marker pytest.mark.asyncio for async tests
Unit tests require 60% coverage, integration tests 10%

Files:

tests/unit/models/config/test_reranker_configuration.py
tests/unit/utils/test_vector_search.py

🧠 Learnings (9)

📚 Learning: 2026-04-19T15:40:25.624Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to **/*.py : Use Final[type] as type hint for all constants

Applied to files:

src/constants.py

📚 Learning: 2026-04-19T15:40:25.624Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to **/constants.py : Central `constants.py` for shared constants with descriptive comments

Applied to files:

src/constants.py

📚 Learning: 2026-03-17T11:34:53.242Z

Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1335
File: docs/config.md:411-411
Timestamp: 2026-03-17T11:34:53.242Z
Learning: In the lightspeed-stack project (`src/models/config.py`, `docs/config.md`), the internal Solr filter `is_chunk:true` (defined as `SOLR_CHUNK_FILTER_QUERY` in `src/constants.py`) is always injected by the system for OKP searches and is intentionally hidden from users. The `chunk_filter_query` field in `OkpConfiguration` is user-facing and additive-only, but the documentation must NOT mention the internal `is_chunk:true` behavior — this is a deliberate design decision by the maintainers.

Applied to files:

src/constants.py
src/models/config.py
tests/unit/utils/test_vector_search.py
src/utils/vector_search.py

📚 Learning: 2026-04-19T15:40:25.624Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Always check `pyproject.toml` for existing dependencies before adding new ones

Applied to files:

pyproject.toml

📚 Learning: 2026-04-19T15:40:25.624Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/config*.py : All config uses Pydantic models extending `ConfigurationBase`

Applied to files:

tests/unit/models/config/test_reranker_configuration.py
src/models/config.py

📚 Learning: 2026-04-19T15:40:25.624Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-04-19T15:40:25.624Z
Learning: Applies to src/**/*.py : Pydantic models extend `ConfigurationBase` for config, `BaseModel` for data models

Applied to files:

src/models/config.py

📚 Learning: 2026-01-12T10:58:40.230Z

Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

src/models/config.py

📚 Learning: 2026-02-25T07:46:33.545Z

Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

src/models/config.py

📚 Learning: 2026-01-30T13:33:40.479Z

Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 1073
File: src/app/endpoints/query_v2.py:575-586
Timestamp: 2026-01-30T13:33:40.479Z
Learning: In `src/app/endpoints/query_v2.py`, the `parse_referenced_documents_from_responses_api` function filters referenced documents to include only those with `doc_title` or `doc_url` because these documents are displayed in the frontend. Documents with only `doc_id` are intentionally excluded as they wouldn't provide useful information to end users.

Applied to files:

src/utils/vector_search.py

coderabbitai · 2026-04-27T19:04:57Z

+    for chunk in rag_chunks:
+        attrs = chunk.attributes or {}
+        doc_url = (
+            attrs.get("reference_url") or attrs.get("doc_url") or attrs.get("docs_url")
+        )
+        doc_id = attrs.get("document_id") or attrs.get("doc_id")
+        dedup_key = doc_url or doc_id or chunk.source or ""
+        if not dedup_key or dedup_key in seen:
+            continue
+        seen.add(dedup_key)
+        parsed_url: Optional[AnyUrl] = None
+        if doc_url:
+            try:
+                parsed_url = AnyUrl(doc_url)
+            except Exception:  # pylint: disable=broad-exception-caught
+                parsed_url = None
+        result.append(
+            ReferencedDocument(
+                doc_title=attrs.get("title"),
+                doc_url=parsed_url,
+                source=chunk.source,
+            )
+        )
+    return result


⚠️ Potential issue | 🟠 Major

Skip referenced documents that have neither title nor URL.

This helper now emits ReferencedDocument objects even when both doc_title and doc_url are missing. Those entries are intentionally filtered out elsewhere because the frontend cannot render them usefully, so this regresses the final referenced-doc list after reranking.

♻️ Suggested guard

for chunk in rag_chunks: attrs = chunk.attributes or {} doc_url = ( attrs.get("reference_url") or attrs.get("doc_url") or attrs.get("docs_url") ) + title = attrs.get("title") doc_id = attrs.get("document_id") or attrs.get("doc_id") dedup_key = doc_url or doc_id or chunk.source or "" if not dedup_key or dedup_key in seen: continue seen.add(dedup_key) @@ - result.append( - ReferencedDocument( - doc_title=attrs.get("title"), - doc_url=parsed_url, - source=chunk.source, - ) - ) + if title is None and parsed_url is None: + continue + result.append( + ReferencedDocument( + doc_title=title, + doc_url=parsed_url, + source=chunk.source, + ) + )

Based on learnings: In src/app/endpoints/query_v2.py, the parse_referenced_documents_from_responses_api function filters referenced documents to include only those with doc_title or doc_url because these documents are displayed in the frontend.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 182 - 205, The loop that builds ReferencedDocument from rag_chunks currently appends entries even when both title and URL are missing; update the logic in the loop (the section that computes attrs, doc_url, parsed_url and before result.append) to skip/continue when both attrs.get("title") is falsy and parsed_url is falsy/None so you only append ReferencedDocument when there is a doc_title or a doc_url; keep existing dedup_key/seen handling and AnyUrl parsing but add a guard (using the computed title and parsed_url) to avoid creating empty ReferencedDocument objects.

Signed-off-by: Anxhela Coba <acoba@redhat.com>

are-ces

The example config in the PR description must be outdated because

top_k_multiplier: 2.0 # Fetch 2x chunks, rerank, keep top_k
byok_boost: 1.2 # Boost factor for BYOK sources
okp_boost: 1.0 # Boost factor for OKP sources

are missing from the configuration in the PR.

Would be nice to see a configured reranker in the example configuration examples/lightspeed-stack-byok-okp-rag.yaml.

Please update unit and integration tests; fix Black and Linting failing tests. TY!

are-ces · 2026-04-28T08:40:51Z

+
+        # Create query-chunk pairs for scoring
+        pairs = [(query, chunk.content) for chunk in chunks]
+        scores = model.predict(pairs)


As CodeRabbit points out, this is synchronous and blocks the process, please make async

are-ces · 2026-04-28T10:13:41Z

    solr_chunks_task = _fetch_solr_rag(client, query, solr)

-    (byok_chunks, byok_docs), (solr_chunks, solr_docs) = await asyncio.gather(
+    (byok_chunks, _), (solr_chunks, _) = await asyncio.gather(


The new _referenced_documents_from_rag_chunks (line 169) duplicates the same extraction logic already present in _process_byok_rag_chunks_for_documents and _process_solr_chunks_for_documents.

On top of that, those two per-source functions are effectively dead code in the current flow, their return values are discarded on line 754:

(byok_chunks, _), (solr_chunks, _) = await asyncio.gather( byok_chunks_task, solr_chunks_task )

...and then _referenced_documents_from_rag_chunks re-extracts the same documents from the final chunks at line 785.

Suggestion: Instead of discarding the per-source documents and re-extracting them, keep the documents returned by _fetch_byok_rag and _fetch_solr_rag, then after reranking, filter/reorder them to match whichever chunks survived the rerank. This avoids duplicating the extraction logic and is more efficient — one lookup pass instead of a second extraction. _referenced_documents_from_rag_chunks can then be removed.

are-ces · 2026-04-28T10:16:30Z

+        # pylint: disable=no-member
+        has_okp = (
+            constants.OKP_RAG_ID in self.rag.inline
+            or constants.OKP_RAG_ID in self.rag.tool


Why should the re-ranked be active if OKP is in RAG as a Tool?

this is looking if okp is enabled either as inline or tool since we can configure okp either way unless I'm mistaken?

are-ces · 2026-04-28T10:17:04Z

+            Self: The validated configuration instance with reranker potentially enabled.
+        """
+        # Check if BYOK RAG entries are configured
+        has_byok = len(self.byok_rag) > 0


Should also check that there are any entries registered in inline RAG

Oh good point!

are-ces · 2026-04-28T10:26:02Z

    Returns:
        RAGContext containing formatted context text and referenced documents
    """
-    if moderation_decision == "blocked":


Why was this removed? This if was there to prevent doing RAG in case of an unsafe input

are-ces · 2026-04-28T10:27:21Z

    client: AsyncLlamaStackClient,
    query: str,
    solr: Optional[SolrVectorSearchRequest] = None,
+    max_chunks: Optional[int] = None,


This parameter is not used

are-ces · 2026-04-28T10:53:43Z

+    byok_chunks_task = _fetch_byok_rag(
+        client, query, vector_store_ids, max_chunks=pool_size
+    )
    solr_chunks_task = _fetch_solr_rag(client, query, solr)


Why are we not increasing the pool size for OKP chunks? I might miss something here...

are-ces · 2026-04-28T10:58:20Z

+    """
+    if model_name not in _cross_encoder_models:
+        try:
+            from sentence_transformers import (  # pylint: disable=import-outside-toplevel


This dep should be moved on top

are-ces · 2026-04-28T11:12:41Z

    # To be able to fix multiple CVEs, also LCORE-1117
    "requests>=2.33.0",
+    # Used for RAG chunk reranking (cross-encoder)
+    "sentence-transformers>=2.0.0",


Should the sentence-transformer dependency be added @syedriko @tisnik? In that case uv.lock will need to be updated too

yes, it needs to be updated to pass CI etc.

thanks missed that - will update uv.lock

tisnik · 2026-04-28T12:26:23Z

+class RerankerConfiguration(ConfigurationBase):
+    """Reranker configuration for RAG chunk reranking."""
+
+    enabled: bool = True


nit: can be initialized as Field so it would be possible to specify title and description.

tisnik · 2026-04-28T12:27:23Z

    # To be able to fix multiple CVEs, also LCORE-1117
    "requests>=2.33.0",
+    # Used for RAG chunk reranking (cross-encoder)
+    "sentence-transformers>=2.0.0",


yes, it needs to be updated to pass CI etc.

Anxhela21 changed the title ~~cross encoder and reranker for byok and solr~~ Cross-Encoder Reranking for Enhanced RAG Quality Apr 27, 2026

Anxhela21 force-pushed the anx/vector_compare branch from e6b6d66 to 880a566 Compare April 27, 2026 15:42

Anxhela21 marked this pull request as ready for review April 27, 2026 17:22

Anxhela21 changed the title ~~Cross-Encoder Reranking for Enhanced RAG Quality~~ [LCORE-1723] Cross-Encoder Reranking for Enhanced RAG Quality Apr 27, 2026

Anxhela21 changed the title ~~[LCORE-1723] Cross-Encoder Reranking for Enhanced RAG Quality~~ [LCORE-1723] Cross-Encoder Reranking Apr 27, 2026

Anxhela21 changed the title ~~[LCORE-1723] Cross-Encoder Reranking~~ [LCORE-1723] Cross-Encoder Reranking for enhanced RAG Apr 27, 2026

Anxhela21 force-pushed the anx/vector_compare branch from 0131555 to 82caaf6 Compare April 27, 2026 17:38

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Anxhela21 changed the title ~~[LCORE-1723] Cross-Encoder Reranking for enhanced RAG~~ LCORE-1723: Cross-Encoder Reranking for enhanced RAG Apr 27, 2026

Anxhela21 added 2 commits April 27, 2026 14:56

Add vector comparison support

6ab1cd8

remove llamastack neural reranker and add unit tests

d664680

Signed-off-by: Anxhela Coba <acoba@redhat.com>

Anxhela21 force-pushed the anx/vector_compare branch from cce6b82 to d664680 Compare April 27, 2026 18:56

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

update tests and clean code

9eb794a

Signed-off-by: Anxhela Coba <acoba@redhat.com>

are-ces requested changes Apr 28, 2026

View reviewed changes

are-ces reviewed Apr 28, 2026

View reviewed changes

tisnik reviewed Apr 28, 2026

View reviewed changes

Conversation

Anxhela21 commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

are-ces left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Anxhela21 Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Anxhela21 commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading

Anxhela21 Apr 29, 2026 •

edited

Loading